library(tidyverse) # for graphing and data cleaning
library(googlesheets4) # for reading googlesheet data
library(lubridate) # for date manipulation
library(ggthemes) # for even more plotting themes
library(geofacet) # for special faceting with US map layout
gs4_deauth() # To not have to authorize each time you knit.
theme_set(theme_minimal()) # My favorite ggplot() theme :)
#Lisa's garden data
garden_harvest <- read_sheet("https://docs.google.com/spreadsheets/d/1DekSazCzKqPS2jnGhKue7tLxRU3GVL1oxi-4bEM5IWw/edit?usp=sharing") %>%
mutate(date = ymd(date))
# Seeds/plants (and other garden supply) costs
supply_costs <- read_sheet("https://docs.google.com/spreadsheets/d/1dPVHwZgR9BxpigbHLnA0U99TtVHHQtUzNB9UR0wvb7o/edit?usp=sharing",
col_types = "ccccnn")
# Planting dates and locations
plant_date_loc <- read_sheet("https://docs.google.com/spreadsheets/d/11YH0NtXQTncQbUse5wOsTtLSKAiNogjUA21jnX5Pnl4/edit?usp=sharing",
col_types = "cccnDlc")%>%
mutate(date = ymd(date))
# Tidy Tuesday data
kids <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-15/kids.csv')
Before starting your assignment, you need to get yourself set up on GitHub and make sure GitHub is connected to R Studio. To do that, you should read the instruction (through the “Cloning a repo” section) and watch the video here. Then, do the following (if you get stuck on a step, don’t worry, I will help! You can always get started on the homework and we can figure out the GitHub piece later):
keep_md: TRUE in the YAML heading. The .md file is a markdown (NOT R Markdown) file that is an interim step to creating the html file. They are displayed fairly nicely in GitHub, so we want to keep it and look at it there. Click the boxes next to these two files, commit changes (remember to include a commit message), and push them (green up arrow).Put your name at the top of the document.
For ALL graphs, you should include appropriate labels.
Feel free to change the default theme, which I currently have set to theme_minimal().
Use good coding practice. Read the short sections on good code with pipes and ggplot2. This is part of your grade!
When you are finished with ALL the exercises, uncomment the options at the top so your document looks nicer. Don’t do it before then, or else you might miss some important warnings and messages.
These exercises will reiterate what you learned in the “Expanding the data wrangling toolkit” tutorial. If you haven’t gone through the tutorial yet, you should do that first.
garden_harvest data to find the total harvest weight in pounds for each vegetable and day of week. Display the results so that the vegetables are rows but the days of the week are columns.garden_harvest%>%
mutate(day_of_week = wday(date, label = TRUE))%>%
group_by(vegetable, day_of_week)%>%
summarise(total_weight_lbs = sum(weight)/454)%>%
pivot_wider(id_cols = vegetable,
names_from = day_of_week,
values_from = total_weight_lbs)
garden_harvest data to find the total harvest in pound for each vegetable variety and then try adding the plot variable from the plant_date_loc table. This will not turn out perfectly. What is the problem? How might you fix it?garden_harvest%>%
group_by(vegetable)%>%
summarise(total_weight_lbs = sum(weight)/454)%>%
left_join(plant_date_loc,
by = "vegetable")
The total weight does not vary by plot. For example, with peas, there are Super Sugar Snap and Magnolia Blossom peas. Sugar Snaps were planted in A, Magnolia Blossoms in B. Before joining the two datasets, peas have a total harvest weight of 17.01 pounds. After joining, both Sugar Snap and Magnolia Blossom peas have a total wait of 17.01 pounds each- we don’t know how much came from plot A or plot B. This could be fixed if we summarized by variety instead of by vegetable.
garden_harvest and supply_cost datasets, along with data from somewhere like this to answer this question. You can answer this in words, referencing various join functions. You don’t need R code but could provide some if it’s helpful.I would start by creating a new dataset with the Whole Foods prices for each vegetable, by weight (if applicable). I would join this dataset with the supply costs dataset, calling it vegetable costs. Then, I would join this dataset with the garden harvest dataset. I would make a new variable that would track the total weight harvested by vegetable. Then, I would summarize the price saved by vegetable: Whole Foods price*pounds harvested-sum(seed price per vegetable). Generally, I think my code would look like:
veg_costs<- supply_costs%>% left_join(Whole_Foods, by = “vegetable”)
garden_harvest%>% left_join(veg_costs, by = “vegetable”)%>% group_by(vegetable)%>% mutate(total_weight_lbs = sum(weight)/454, total_seeds = sum(price))%>% summarize(total_saved = wholefoodsprice*total_weight_lbs - total_seeds)
garden_harvest%>%
filter(vegetable == "tomatoes")%>%
mutate(variety2 = fct_reorder(variety, date))%>%
group_by(variety2)%>%
summarise(total_weight_lbs = sum(weight)/454)%>%
ggplot(aes(y = variety2, x=total_weight_lbs))+
geom_col()+
labs(title = "Tomato varieties by size of first harvest date", x = "weight of first harvest", y = "variety")
garden_harvest data, create two new variables: one that makes the varieties lowercase and another that finds the length of the variety name. Arrange the data by vegetable and length of variety name (smallest to largest), with one row for each vegetable variety. HINT: use str_to_lower(), str_length(), and distinct().garden_harvest%>%
mutate(lower_variety = str_to_lower(variety),
length_variety = str_length(lower_variety))%>%
arrange(vegetable, length_variety)%>%
distinct(vegetable, variety, .keep_all = TRUE)
garden_harvest data, find all distinct vegetable varieties that have “er” or “ar” in their name. HINT: str_detect() with an “or” statement (use the | for “or”) and distinct().garden_harvest%>%
mutate(lower_variety = str_to_lower(variety),
has_er = str_detect(lower_variety, "er"),
has_ar = str_detect(lower_variety, "ar"))%>%
filter(has_er == TRUE | has_ar == TRUE)%>%
distinct(vegetable, variety, .keep_all = TRUE)
In this activity, you’ll examine some factors that may influence the use of bicycles in a bike-renting program. The data come from Washington, DC and cover the last quarter of 2014.
{300px}
{300px}
Two data tables are available:
Trips contains records of individual rentalsStations gives the locations of the bike rental stationsHere is the code to read in the data. We do this a little differently than usualy, which is why it is included here rather than at the top of this file. To avoid repeatedly re-reading the files, start the data import chunk with {r cache = TRUE} rather than the usual {r}.
data_site <-
"https://www.macalester.edu/~dshuman1/data/112/2014-Q4-Trips-History-Data.rds"
Trips <- readRDS(gzcon(url(data_site)))
Stations<-read_csv("http://www.macalester.edu/~dshuman1/data/112/DC-Stations.csv")
NOTE: The Trips data table is a random subset of 10,000 trips from the full quarterly data. Start with this small data table to develop your analysis commands. When you have this working well, you should access the full data set of more than 600,000 events by removing -Small from the name of the data_site.
It’s natural to expect that bikes are rented more at some times of day, some days of the week, some months of the year than others. The variable sdate gives the time (including the date) that the rental started. Make the following plots and interpret them:
sdate. Use geom_density().ggplot(data = Trips, aes(x = sdate))+
geom_density()+
labs(title = "Density of events by start date", x = "start date")
More events start in October than in November or December, and more events start in November than in December. By month, events seem to happen most at the very beginning, in the very middle, and at the very end of the month- each month alone seems to be trimodal.
mutate() with lubridate’s hour() and minute() functions to extract the hour of the day and minute within the hour from sdate. Hint: A minute is 1/60 of an hour, so create a variable where 3:30 is 3.5 and 3:45 is 3.75.Trips%>%
mutate(hour_num = hour(sdate),
minute_dec = minute(sdate)/60,
totaltime = hour_num + minute_dec)%>%
ggplot(aes(x=totaltime))+
geom_density()+
labs(title = "Density of events by time of day", x = "time of day", y = "density")
This distribution looks bimodal, with maximums at 8:00 and 18:00. It is most common for trips to start fairly early in the morning or in the early evening, about the time one might need to rent a bike to go to work and the time one might need to rent a bike to return home from work.
Trips%>%
mutate(day_of_week = wday(sdate, label = TRUE))%>%
ggplot(aes(y=day_of_week))+
geom_bar()+
labs(title = "Number of events by day of week", x = "number of events", y = "day of week")
This graph confirmed my suspicions about commuter rentals, as it shows weekdays (M-F) having the highest counts for events when compared to weekends (Saturday and Sunday). If the rentals were more for leisure, I would assume most of the rentals would occur on weekend days.
Trips%>%
mutate(hour_num = hour(sdate),
minute_dec = minute(sdate)/60,
totaltime = hour_num + minute_dec,
day_of_week = wday(sdate, label = TRUE))%>%
ggplot(aes(x=totaltime))+
geom_density()+
facet_wrap(vars(day_of_week))+
labs(title = "Density of events by time of day", x = "time of day", y = "density")
On weekdays (Monday-Friday) the event starting time distribution is bimodal, with modes at about 8:00 and 17:30. On weekends (Saturday and Sunday) the distribution is normal, with a mean start time of about 13:00. This makes sense- weekdays chart bike commuter rent times, weekends track leisurely fun bike rentals.
The variable client describes whether the renter is a regular user (level Registered) or has not joined the bike-rental organization (Causal). The next set of exercises investigate whether these two different categories of users show different rental behavior and how client interacts with the patterns you found in the previous exercises. Repeat the graphic from Exercise @ref(exr:exr-temp) (d) with the following changes:
fill aesthetic for geom_density() to the client variable. You should also set alpha = .5 for transparency and color=NA to suppress the outline of the density function.Trips%>%
mutate(hour_num = hour(sdate),
minute_dec = minute(sdate)/60,
totaltime = hour_num + minute_dec,
day_of_week = wday(sdate, label = TRUE))%>%
ggplot(aes(x=totaltime, fill = client))+
geom_density(alpha = .5, color = "NA")+
facet_wrap(vars(day_of_week))+
labs(title = "Density of events by time of day", x = "time of day", y = "density")
From this graph, we can see that the commuter vs leisure rental hypothesis holds up. On weekend days, both registered and unregistered users tend to rent bikes between 10:00 and 17:00, peak times for more leisurely activities. On weekdays, registered users rent around morning and evening commute times, while casual renters still rent during more “leisurely” times.
position = position_stack() to geom_density(). In your opinion, is this better or worse in terms of telling a story? What are the advantages/disadvantages of each?Trips%>%
mutate(hour_num = hour(sdate),
minute_dec = minute(sdate)/60,
totaltime = hour_num + minute_dec,
day_of_week = wday(sdate, label = TRUE))%>%
ggplot(aes(x=totaltime, fill = client))+
geom_density(position = position_stack(), alpha = .5, color = "NA")+
facet_wrap(vars(day_of_week))+
labs(title = "Density of events by time of day", x = "time of day", y = "density")
Personally, I do not enjoy this graph as much. At first glance, because casual and registered clients are stacked, it looks like there both casual users and registered users follow a bimodal trend on weekdays. From the previous graph I know this is not true, but at first glance that can be quite confusing. This does show there are many more registered users than casual users in the morning commute times, but the way it is presented isn’t as clear in my opinion.
weekend which will be “weekend” if the day is Saturday or Sunday and “weekday” otherwise (HINT: use the ifelse() function and the wday() function from lubridate). Then, update the graph from the previous problem by faceting on the new weekend variable.Trips%>%
mutate(hour_num = hour(sdate),
minute_dec = minute(sdate)/60,
totaltime = hour_num + minute_dec,
day_of_week = wday(sdate, label = TRUE),
weekend = ifelse(day_of_week == "Sat" | day_of_week == "Sun", "weekend", "weekday"))%>%
ggplot(aes(x=totaltime, fill = client))+
geom_density(alpha = .5, color = "NA")+
facet_wrap(vars(weekend))+
labs(title = "Density of events by time of day", x = "time of day", y = "density")
This graph is a good visual representation of the weekend/weekday leisure/commute differences. On weekdays, the registered distribution is bimodal, centered at about 8:00 and 18:00. On weekends, the registered distribution is normal, centered at about 13:00. For casual users, weekends and weekdays are both normal and centered at 15:00.
client and fill with weekday. What information does this graph tell you that the previous didn’t? Is one graph better than the other?Trips%>%
mutate(hour_num = hour(sdate),
minute_dec = minute(sdate)/60,
totaltime = hour_num + minute_dec,
day_of_week = wday(sdate, label = TRUE),
weekend = ifelse(day_of_week == "Sat" | day_of_week == "Sun", "weekend", "weekday"))%>%
ggplot(aes(x=totaltime, fill = weekend))+
geom_density(alpha = .5, color = "NA")+
facet_wrap(vars(client))+
labs(title = "Density of events by time of day", x = "time of day", y = "density")
This graph reinforces observations I made in the previous graph. On weekdays, the registered distribution is bimodal, centered at about 8:00 and 18:00. On weekends, the registered distribution is normal, centered at about 13:00. For casual users, weekends and weekdays are both normal and centered at 15:00. This graph allows the difference between casual (leisurely) use and registered (commuter) use to be seen more clearly.
Stations to make a visualization of the total number of departures from each station in the Trips data. Use either color or size to show the variation in number of departures. We will improve this plot next week when we learn about maps!Stations%>%
mutate(sstation = name)%>%
left_join(Trips,
by = "sstation")%>%
group_by(name, lat, long)%>%
count()%>%
ggplot(aes(x=long, y=lat, size = n))+
geom_point(alpha = .5)+
labs(title = "Number of departures from each rental location", x = "longitude", y= "latitude")
There is a much higher number of departures at around 38.9 latitude by -77.025 longitude. There must be a popular station in this area.
Stations%>%
left_join(Trips,
by = c("name" = "sstation"))%>%
group_by(name, lat, long)%>%
summarize(casual_percent = mean(client == "Casual"))%>%
ggplot(aes(x=long, y=lat, color=casual_percent))+
geom_point()+
labs(title = "Percent of sasual departures from each rental location", x = "longitude", y= "latitude")
From the previous graph, it looked like there was a popular station at around 38.9 latitude by -77.025 longitude. This matches pretty well with the location of stations with the highest percentage of casual rentals. Maybe there is a popular tourist site around here that attracts bike renters?
as_date(sdate) converts sdate from date-time format to date format.topTrips<-
Trips%>%
mutate(tripdate = as_date(sdate))%>%
group_by(sstation, tripdate)%>%
count()%>%
arrange(desc(n))%>%
head(10)
topTrips
Trips%>%
mutate(tripdate = as_date(sdate))%>%
inner_join(topTrips,
by = c("sstation", "tripdate"))
Trips%>%
mutate(tripdate = as_date(sdate))%>%
inner_join(topTrips,
by = c("sstation", "tripdate"))%>%
mutate(day_of_week = wday(sdate, label = TRUE))%>%
group_by(client, day_of_week)%>%
summarize(number = n())%>%
mutate(prop = number/sum(number))%>%
pivot_wider(id_cols = day_of_week,
names_from = client,
values_from = prop)
DID YOU REMEMBER TO GO BACK AND CHANGE THIS SET OF EXERCISES TO THE LARGER DATASET? IF NOT, DO THAT NOW.
This problem uses the data from the Tidy Tuesday competition this week, kids. If you need to refresh your memory on the data, read about it here.
facet_geo(). The graphic won’t load below since it came from a location on my computer. So, you’ll have to reference the original html on the moodle page to see it.kids%>%
group_by(state,year)%>%
filter(variable == "lib")%>%
summarize(total_spent = mean(inf_adj_perchild, nna.rm = TRUE))%>%
ggplot(aes(x=year, y=total_spent))+
geom_line(color = "white", size = 2)+
facet_geo(vars(state), scales="free")+
labs(title = "Change in public spending on libraries from 1997 to 2016", subtitle = "Thousands of dollars spent per child, adjusted for inflation", x = "", y = "")+
theme_void()+
theme(
plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
plot.subtitle = element_text(hjust = 0.5, size = 16),
plot.background = element_rect(fill = "#7E95A9"),
panel.border = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank()
)
DID YOU REMEMBER TO UNCOMMENT THE OPTIONS AT THE TOP?